fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context#849
fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context#849sayakmaity wants to merge 5 commits into
Conversation
|
SGP consumer wiring that populates |
…infra_config().env
|
Follow-up commit (
|
|
Review follow-up ( Root cause confirmed for both:
Fix:
Verified with |
… leakage across emissions
|
Round 3 review follow-up: [P3] Regenerated the checked-in fallback template ( [Note] Fixed the mutable [Note] |
Problem
model-engine's Datadog
envtag is derived from the build-time Helm value.Values.context, so every deployment reports the sameenv(e.g.production) regardless of which cluster/environment it actually runs in. This affects two independent surfaces:DD_ENVand thetags.datadoghq.com/envlabel both come straight from.Values.context.DD_ENVis baked into the renderedservice_template_config_mapfrom.Values.contextathelm templatetime, so it's frozen in the image and identical for every cluster the image is deployed to (unlike the neighboringDD_SERVICE/DD_VERSION, which are already${...}runtime-substituted).Approach
Introduce a dedicated, optional
datadog.envvalue, decoupled fromcontext(which is overloaded — it also selects theservice_template_config_mapvariant, drives acontext == "circleci"conditional, and sets non-DD labels, so it must not be repurposed). The fix mirrors the existingGIT_TAG/DD_VERSIONsplit exactly:DD_ENV/ env label ={{ .Values.datadog.env | default .Values.context }}(Helm value, backwards-compatible: falls back tocontextwhen unset).DD_ENV=${DD_ENV}runtime substitution, populated at endpoint-creation from the gateway's ownDD_ENV(so launched pods inherit the gateway's per-cluster env). This follows the same pattern as${DD_TRACE_ENABLED}/${GIT_TAG}.Changes
Chart (
charts/model-engine)values.yaml: add optionaldatadog.env(default"")._helpers.tpl:baseLabels:tags.datadoghq.com/env→datadog.env | default context.serviceEnvBase: remove the hardcodedDD_ENV(moved into the wrappers, likeGIT_TAG).serviceEnvGitTagFromHelmVar(control-plane): addDD_ENV=datadog.env | default context.serviceEnvGitTagFromPythonReplace(endpoints): addDD_ENV=${DD_ENV}.baseServiceTemplateEnv/baseForwarderTemplateEnv(legacy endpoint paths):DD_ENV→${DD_ENV}.celery_autoscaler_stateful_set.yaml:$env→datadog.env | default context.Chart.yaml:0.2.6→0.2.7.Server (
model-engine)common/env_vars.py: addDD_ENV(os.environ.get("DD_ENV") or infra_config().env).infra/gateways/resources/k8s_resource_types.py: addDD_ENVto_BaseDeploymentArgumentsand pass it in all 11 deployment-argument constructors (alongsideDD_TRACE_ENABLED).Validation
helm template(withvalues_sample.yaml): control-planeDD_ENV/label render todatadog.envwhen set and fall back tocontextwhen unset; inferenceDD_ENVrenders to${DD_ENV}for runtime substitution. No<no value>/ unrendered{{ }}.python -m astparse clean on both edited modules.safe_substitute, andget_endpoint_resource_arguments_from_requestnow suppliesDD_ENV, so${DD_ENV}resolves at endpoint creation. The existingtest_k8s_endpoint_resource_delegate.pytests render the real chart and assert structurally (no golden env-block comparison), so they remain compatible.model-enginetest suite locally (heavy service deps) — CI should confirm.Out of scope / known limitations
tags.datadoghq.com/env(baseTemplateLabels) is intentionally left oncontextfor now — that helper is shared with job templates, so making it${DD_ENV}would require threadingDD_ENVthrough the job argument classes too. TheDD_ENVenv var is the authoritative APMenvsource, so this is sufficient to fix env tagging on traces/metrics; the label can follow up.service_template_config_map_<env>.yamlinto an image (e.g. model-engine-internal'sjust autogen-templates+ image build) must regenerate templates and rebuild the image to pick up${DD_ENV}; and the rendered config map'sDD_ENVline will become a runtime placeholder.Draft pending CI + downstream coordination.